feat: Add session resilience and context budget management#9
feat: Add session resilience and context budget management#9mdear wants to merge 9 commits intoIntelligent-Internet:mainfrom
Conversation
This commit introduces two major features: WebSocket session resilience for surviving temporary disconnects, and proactive context budget management to prevent context window overflow. ## Session Resilience (Backend + Frontend) - JWT-based session security with HttpOnly fingerprint cookie binding (RFC 8725) - Dual heartbeat mechanism (server 30s ping + client 20s heartbeat) - 120-second reconnection grace period with event buffering - Automatic token refresh at 90% of JWT lifespan - Run state preservation during disconnects New files: - core/api/connection_manager.py - Connection state and event buffering - core/api/session_security.py - JWT lifecycle and fingerprint binding - frontend/lib/sessionManager.ts - Client-side session management ## Context Budget Management - Provider-aware token counting (Anthropic, OpenAI, Google APIs) - Circuit breaker pattern: warning at 40%, force completion at 55% - Budget-aware content selection for work module inheritance - Per-worker budget allocation for parallel execution New files: - core/agent_core/llm/token_counter.py - Accurate provider-specific counting - core/agent_core/framework/context_budget_guardian.py - Threshold monitoring - core/agent_core/utils/content_selection.py - Budget-aware inheritance ## Test Coverage - 878 backend unit tests (33 new test files) - 25 frontend tests for session management - Coverage for all new modules ## Documentation - docs/architecture/session-resilience.md - Full design specification - docs/architecture/context-budget-management.md - Budget system design - docs/guides/04-debugging.md - Added CLI tools documentation - scripts/analyze_session.py - Session analysis utility - scripts/commonground.sh - Service manager script ## Other Changes - Graceful shutdown with connection cleanup - Updated pyproject.toml with uv export instructions - Anthropic-specific LLM configs for accurate token budgeting - Agent profile updates for budget-aware operation
|
Hi, team, here are some stability and resilience fixes that I did to support integration of my own MCP server (proprietary knowledge base for wheelchair seating/mobility, which is capable of quickly overwhelming a model's context without proper controls). My strengths lie mostly in backend infrastructure, so I kept my frontend changes light, really only enough so I could have enough stability to be able to properly evaluate this solution. I introduced unit test infras, capturing all backend current behavior. Respect! This is my way of showing in a (hopefully) useful way that I support you and what you are trying to do. Any and all constructive criticism/review/suggestions are welcome. |
…wareness Context Budget System: - Add context_admission_controller for pre-admission budget enforcement - Add context_budget_handback for Principal-delegated summarization - Update thresholds: WARNING 60%, CRITICAL 75%, EXCEEDED 85% - Implement agent-type-aware forcing (Principal/Associate only) - Partner agents receive guidance only (no flow-ending tools) Orphan Detection: - Add detect_orphaned_tool_interactions() to turn_manager - Add finalize_orphaned_tool_interactions() for recovery - Add detect_dispatch_anomalies() to dispatcher_node Session Analysis (analyze_session.py): - Add --mode handoff/thrashing/errors analysis modes - Fix analyze_work_modules() to aggregate ALL context_archive entries - Add dispatch_count tracking for thrashing detection - Improve error detection to avoid false positives Bug Fixes: - Fix DuckDBRAGStore unawaited coroutine warning (lazy init) - Rename test_jina_* to check_jina_* to avoid pytest auto-discovery - Remove unused pythonjsonlogger import (deprecation warning) - Fix sessionManager to always create fresh session_id for WS Frontend: - Increase node fallback dimensions for better visual fit - Fix sessionManager reconnection flow Docs: - Update context-budget-management.md with implementation status Tests: 934 passed, 1 skipped
Flow visualization improvements: - Add dynamic minZoom that adapts to card count (see all cards at min zoom) - Fix maxZoom at 1.5x for readable card text regardless of card count - Align scroll wheel zoom speed between minimap and canvas (~9 clicks) - Add translateExtent to constrain panning within node bounds - Add status-based MiniMap colors (blue=running, green=success, red=error) Scroll and layout fixes: - Fix page-level scrolling by adding overflow:hidden to html/body/SidebarProvider - Fix auto-scroll on page load (scrollIntoView block:'nearest') - Add overscroll-contain to ChatHistory to prevent scroll chaining Swim lane layout (flow-utils.ts): - Rewrite layout algorithm for fixed-width swim lanes per agent - Increase node fallback dimensions for better readability - Add minimum dimension enforcement in getNodeSize() Files changed: - FlowView.tsx: zoom config, MiniMap styling, ReactFlowProvider wrapper - ChatLayout.tsx: overflow-hidden on panels - Workspace.tsx: overflow-hidden on container - ChatHistory.tsx: overscroll-contain - flow-utils.ts: swim lane algorithm - globals.css: html/body overflow hidden - layout.tsx: SidebarProvider height constraints - r/page.tsx: scrollIntoView fix
Associates that output JSON deliverables without calling `finish_flow`
would have their work lost, as the system only triggers deliverable
extraction when `finish_flow` is invoked. Live session analysis
revealed this caused re-dispatching.
The `generate_message_summary` instructional prompt told agents "DO NOT
call any tools" after outputting JSON, but `finish_flow` IS required to
trigger `_extract_deliverables_from_messages()` and capture the work.
- Updated instructional prompt to explicitly describe the 3-response
sequence: generate_message_summary → JSON output → finish_flow
- Added critical warning about deliverable capture requirement
- Added "Finish Protocol" section documenting the completion sequence
- Updated self-reflection to detect JSON-without-finish_flow state
- Fixed observation text to match actual trigger conditions
- Added "CRITICAL CHECK" for JSON deliverable detection
- Updated instructions to guide agents through finish protocol
- Fixed incomplete sentence ("MUST synthesis" → proper guidance)
- Updated Deliver step to mention `finish_flow` requirement
- Analyzed production runs confirming the JSON → finish_flow
sequence across all completed work modules
- All unit tests pass
- No regressions expected - changes are corrective/additive
Flow visualization now groups disconnected subgraphs into time-sorted epochs, ensuring timestamps always flow top-to-bottom (swimlane style). Changes: - Detect epochs via flood-fill of disconnected turn subgraphs - Sort epochs by earliest timestamp for chronological ordering - Add epoch separator nodes between epochs with proper labels - Add "Epoch 1" header when multiple epochs exist - Create edges connecting separators to adjacent epoch roots/leaves - Filter Partner and user_turn before epoch detection - Add epoch_separator nodeType to frontend FlowView component - Update FlowViewModel documentation in API reference - Add comprehensive unit tests for epoch detection logic Fixes issue where cards from re-dispatched work modules appeared out of chronological order in the flow visualization.
- Detect disconnected subgraphs as epochs, sort by timestamp - Add epoch separator nodes with edges to adjacent epochs - Show "Epoch N" headers only when multiple epochs exist - Filter Partner/user_turn before epoch detection - Update frontend to render epoch_separator nodeType - Update API docs for FlowViewModel epoch fields
Core Fixes:
- Add tool conflict resolution to prioritize finish_flow when called
with other tools (prevents silent data loss from juicy-winged-adder)
- Add synchronous completion handler ensuring deliverables propagate
to Partner inbox before session save
- Add /api/reports/{project_id}/{filename} endpoint for report downloads
Port Configuration:
- Consolidate all port config to core/.env as single source of truth
- Update commonground.sh to read BACKEND_PORT/FRONTEND_PORT from .env
- Update run_server.py to use .env defaults for host/port
- Update analysis scripts to read from .env instead of hardcoding
Pagination Enhancements:
- Add work module archive pagination to get_paginated_run_snapshot
- Support work_module_id and archive_index params for deep inspection
- Update message_handlers.py to pass new pagination params
- Update live_session_query.py to reconstruct archives with pagination
Frontend:
- Fix heartbeat/reconnect message format to use data wrapper
- Increase heartbeat tolerance (30s interval, 4 missed max)
Tests & Docs:
- Add test_tool_conflict_resolution.py (10 tests)
- Add TestTeamStatePagination tests (7 tests)
- Add context-management-fixes.md design document
- Restore DEFAULT_API_KEY/DEFAULT_BASE_URL in env.sample
…tion - Add `allowed_at_critical` parameter to tool registry for budget-aware filtering - User prompts (USER_PROMPT, PARTNER_DIRECTIVE, PRINCIPAL_COMPLETED) bypass circuit breaker to use reserved 15% headroom - Restrict Partner tools to read-only at CRITICAL/EXCEEDED thresholds - Mark flow-terminating tools (finish_flow, generate_message_summary) and GetPrincipalStatusSummaryTool as critical-safe - Add filter_tools_for_critical_budget() in agent_strategy_helpers - Add --output flag to analyze_session.py and live_session_query.py Tests: 13 new tests for strategy helpers, 4 for guardian bypass, 6 for registry
This commit introduces two major features: WebSocket session resilience for surviving temporary disconnects, and proactive context budget management to prevent context window overflow.
Session Resilience (Backend + Frontend)
New files:
Context Budget Management
New files:
Test Coverage
Documentation
Other Changes